Deletion of NA’s was applied to the three gene expression data frames (pan cancer and tumor and normal tissue data). Because the dimension of our data frames did not change during this process, it was assumed that there were no NA’s in the data sets.
The goal of the analysis was to identify the genes that show a significantly different expression in certain tumor types (Pan cancer analysis) or in comparison from normal and tumor tissue (THCA Analysis). Therefore genes with a similar expression in all patients are not relevant. Probably, these would be mostly housekeeping genes.
The histogram of the logarithmised variances of the pancancer data is displayed in . The threshold of -1 was fixed and all genes with a lower variance were omitted. Doing so, the number of genes reduced from 60,000 to 19,000 genes.
The low-variance filtering of the THCA dataset was done in a similar way. The gene expression data of the cancer tissue was used to obtain the logarithmised variances of each gene. Genes with a lower variance than -1.25 were deleted in the tumor tissue and the normal tissue data. This resulted in a reduction from about 20,000 genes to 15,000 genes in both data frames.
The biotype of the genes from the selected metabolic pathways, the genes of the hallmark pathways and the genes of the gene expression matrix was determined, to keep those genes with the same biotype. Because the most genes are protein-coding, only protein-coding genes were maintained.
In the mean-variance plot, displayed in Figure @ref(fig:showmeanvariance), genes with a very high variance and non-zero mean were annotated their ensemble ids.
Mean-variance plot of cleaned TCGA expression data. Y-axis shows variance of a genes expression, x-axis shows mean of a genes expression
For descriptive analysis 5 violin plots were created for 5 different tumor types with the gene expression data from the TCGA Matrix, to compare the distribution of the data of 5 different tumor types. The violinplot can be seen in Figure @ref(fig:showviolinplots). The white point in the middle of each plot shows the 50% quantile. For all tumor types it is located in the middle of the gene expression value 0 and 5. In this area is also the highest amount of genes for every tumor type. Going to the top or bottom the curve flattens because only a few genes are expression very high or very low. One can see, that the the distributions of all 5 tumor types are very similar. It can be concluded, that the other 28 tumor types of this data set are distributed in a similar way.
Mean-variance plot of cleaned TCGA expression data
For the construction of the volcano plot the data for THCA from the data set for the focused analysis was used. The volcano plot is displayed in Figure @ref(fig:showvolcanoplot). Not significantly differentially expressed genes were marked green, significantly over expressed genes are marked blue and significantly under expressed genes are marked red. The gene with a very low p-value differ the most from tumor to normal tissue and are annotated with their name.
Volcano plot of THCA expression data
The results of the GSEA for THCA tissue can be seen in Figure @ref(fig:GSEAHeat)
GSEA performed on the THCA expression data, annotated with Pathway type, histoligical type and cluster.
The patients are arranged horizontally, the pathways vertically. The heatmap shows the intensity of expression of each pathway in each patient. Red pathways are overexpressed, blue pathways are underexpressed. The axes were annotated with the pathways type (hallmark or metabolic), the histological type of the tumor and the cluster. Three main clusters form within the patients, that can be explained by similarities in pathway activity. It is possible, that the formation of clusters is caused by different pathways activity of each tumor type, because in cluster 3, only the classical type occurs, while tall cell thyroid cancer mainly occur in cluster 2 and folicular thyroid cancers mainly occur in cluster 1.
Um die Clusterbildung zu bestätigen wurde die gleiche Analyse für die THCA daten aus dem großen gene expression dataframe durchgeführt. Auch in dieser Analyse formten sich 3 Cluster, die in zum Teil auf die hiytological types zurückzuführen sind. xxx
To display the results of GSVA, obtained by analysis of metabolic and hallmark pathways and their expression in the gene expression data frame, a heat map, annotated with cancer type, histological type and pathway type was created. Even though, the individual pathways are not annotated, two observations were made, because it is possible to see similarities in pathway activity within groups of patients and tumor types. Firstly 3 clusters in THCA expression were detected to analyse them further in the next steps. Secondly, most tumor types were clustered clearly, while others did not see to form explicit clusters regarding the tumor type, but regarding the histological type. An observation, that is going to be analysed further.
Results of GSVA, annotated with histological type, cancer type, pathway type and clusters
The observations, obtained from GSVA and the generated heatmap were checked with a heat map, displaying the mean expression of each pathway in each tumor type, annotated with histological type and pathway type.
Die mittleren Expressionswerte: große MAtrix mit 10,000 Patienten : mitteln der Expression enines pathways über alle patienten die einen Krebstyp haben : im durchschnitt ist xy in der krebs runterreguliert: die sind dann auch in der kleineren Hetamap dargestellt –> daraus kann mann dann sagen, dass alle glibalstome zb ähnlich aussehen –> unterschiedliche Genexpressionprofile hängen nicht von dem Patienten ab, sondern nur von dem histological type + da sind die hallmark genes weißlicher weil die in allen Krebs ca. gleich stark exprimiert sind –> Bestätigung der Hallmrks (hier noch die ABbildung dazu zeigen)
Die PCA wurde anhand der Pathway Aktivitäten durchgeführt und reduziert auf die ersten 2 PCs in Figure @ref(fig:PCAPanType) dargestellt. Es sind ansatzweise cluster, aber keine eindeutigen Strukturen zu erkennen. Die zusammensetzung der PCs und die Erklärung der AVrianz durch die einzelnen VAriablen kann Figure @ref(fig:) entnommen werden. BESCHREIBUNG DER PCA UND DER VARIANZVERTEILUNG. Der Plot wurde auch nach Form des Tumors angefärbt und in Figure @ref(fig:PCAPanForm) dargestellt. Hier ist eine klarere Struktur zu erkennen als bei Anfärbung nach Tumortyp. Daraus lässt sich schließen, dass zumindest die PC1 und PC2 inkludierten Pathways mit der Form des Tumors zusammenhängen.
Um die Ergebnisse zu überprüfen wurde die PCA auch auf die Genaktivität angewendet, es war kein Unterschied zu erkennen.
PCA of TCGA expression data, colored by tumor type
PCA of TCGA expression data, colored by form of tumor
DA durch die PCA keine eindeutigen Cluster erkennbar wurden, wurden die Ergebnisse der PCA genutzt, um eine UMAP zu erstellen. Die Ergebnisse sind in Figure @ref(fig:UMAPPanType) zu sehen. Hier sind eindeutige Cluser zu erkennen, auch wenn die Abstände zwischen den Clustern nicht proportional zu den wahren Unterschieden sind, kann daraus geschlossen werden, dass die einzelnen Tumortypen charakteristische Pathwayaktivitäten haben. In der Mitte ist ein großes Cluster zu erkennen, das keinem Tumortyp eindeutig zugeordnet werden kann. Deswegen wurde eine UMAP nach der Form markiert.
Die Ergebnisse sind in Figure @ref(fig:UMAPPanForm) zu sehen. Das mittlere Cluster konnte zwar nicht einem bestimmten histological type zugeornet werden, allerdings besteht das mittlere Cluster größtenteils aus Squamous cell carcinoma, transitional cell carnimoa und anderen Carcinomas. Es kann erkannt werden, dass verschiedenen hystological types teilweise ein verschiedenes expressionmuster haben, die verschiedenen types der carcinoma haben ähnliche expressionsmuster.
UMAP of TCGA expression data, colored by tumor type
UMAP of TCGA expression data, colored by form of the tumor
PCA was also performed for focused analysis, um Untergruppen, die mit der Pathwayaktivität oder Genaktivität zusammenhängen, innerhalb der Thyroid tumors zu finden. Dafür wurden zuerst die Ergebnisse der GSEA verwendet. Der Versuch Cluster zu finden gestaltete sich hier leider schwerer, da weder durch Betrachtung des Thyroid Tumortyps, noch durch Betrachtung der Stage ein eindeutiger Zusammenhang erkannt werden konnte.
A figure x was generated for THCA gene expression data obtained from the gene expression data frame. The obtained figure x can be seen in Figure @ref(fig:figurex). Thereby pathways with the lowest p-value could be identified. The pathway for thyroxine biosynthesis is a downregulated pathway with a low p-value and is going to be predicted by linear regression analysis and the neuronal network.
Figure X. -log10(p-values) plotted against ranked p-values, obtained from GSVA on THCA expression data from gene expression data frame. Left: downregulated pathways, right: upregulated pathways